Algorithms for Molecular Biology — Latest Matching Preprints

1

Pibiri, G. E.

2022-05-24 bioinformatics 10.1101/2022.05.23.493024 medRxiv

Top 0.1%

29.0%

Show abstract

We consider the problem of representing a set of k-mers and their abundance counts, or weights, in compressed space so that assessing membership and retrieving the weight of a k-mer is efficient. The representation is called a weighted dictionary of k-mers and finds application in numerous tasks in Bioinformatics that usually count k-mers as a pre-processing step. In fact, k-mer counting tools produce very large outputs that may result in a severe bottleneck for subsequent processing. In this work we extend the recently introduced SSHash dictionary (Pibiri, Bioinformatics 2022) to also store compactly the weights of the k-mers. From a technical perspective, we exploit the order of the k-mers represented in SSHash to encode runs of weights, hence allowing (several times) better compression than the empirical entropy of the weights. We also study the problem of reducing the number of runs in the weights to improve compression even further and illustrate a lower bound for this problem. We propose an efficient, greedy, algorithm to reduce the number of runs and show empirically that it performs well, i.e., very similarly to the lower bound. Lastly, we corroborate our findings with experiments on real-world datasets and comparison with competitive alternatives. Up to date, SSHash is the only k-mer dictionary that is exact, weighted, associative, fast, and small.

2

Rapidly Computing the Phylogenetic Transfer Index

Truszkowski, J. M.; Gascuel, O.; Swenson, K.

2019-08-22 bioinformatics 10.1101/743948 medRxiv

Top 0.1%

27.5%

Show abstract

Given trees T and T* on the same taxon set, the transfer index {phi}(b, T*) is the number of taxa that need to be ignored so that the bipartition induced by branch b in T is equal to some bipartition in T*. Recently, Lemoine et al. [14] used the transfer index to design a novel bootstrap analysis technique that improves on Felsensteins bootstrap on large, noisy data sets. In this work, we propose an algorithm that computes the transfer index for all branches b [isin] T in O(n log3 n) time, which improves upon the current O(n2)-time algorithm by Lin, Rajan and Moret [15]. Our implementation is able to process pairs of trees with hundreds of thousands of taxa in minutes and considerably speeds up the method of Lemoine et al. on large data sets. We believe our algorithm can be useful for comparing large phylogenies, especially when some taxa are misplaced (e.g. due to horizontal gene transfer, recombination, or reconstruction errors).

3

Fast and Optimal Sequence-to-Graph Alignment Guided by Seeds

Ivanov, P.; Bichsel, B.; Vechev, M.

2021-11-08 bioinformatics 10.1101/2021.11.05.467453 medRxiv

Top 0.1%

26.1%

Show abstract

We present a novel A[*] seed heuristic that enables fast and optimal sequence-to-graph alignment, guaranteed to minimize the edit distance of the alignment assuming non-negative edit costs. We phrase optimal alignment as a shortest path problem and solve it by instantiating the A[*] algorithm with our seed heuristic. The seed heuristic first extracts non-overlapping substrings (seeds) from the read, finds exact seed matches in the reference, marks preceding reference positions by crumbs, and uses the crumbs to direct the A[*] search. The key idea is to punish paths for the absence of foreseeable seed matches. We prove admissibility of the seed heuristic, thus guaranteeing alignment optimality. Our implementation extends the free and open source aligner and demonstrates that the seed heuristic outperforms all state-of-the-art optimal aligners including GO_SCPLOWRAPHC_SCPLOWAO_SCPLOWLIGNERC_SCPLOW, VO_SCPLOWARGASC_SCPLOW, PO_SCPLOWAC_SCPLOWSGAL, and the prefix heuristic previously employed by ASO_SCPLOWTARIXC_SCPLOW. Specifically, we achieve a consistent speedup of >60x on both short Illumina reads and long HiFi reads (up to 25kbp), on both the E. coli linear reference genome (1Mbp) and the MHC variant graph (5Mbp). Our speedup is enabled by the seed heuristic consistently skipping >99.99% of the table cells that optimal aligners based on dynamic programming compute. ASO_SCPLOWTARIXC_SCPLOW aligner and evaluations: https://github.com/eth-sri/astarix Full paper: https://www.biorxiv.org/content/10.1101/2021.11.05.467453

4

RecGraph: adding recombinations to sequence-to-graph alignments

Avila Cartes, J.; Bonizzoni, P.; Ciccolella, S.; Della Vedova, G.; Denti, L.; Monti, D.; Pirola, Y.; Porto, F.

2022-10-28 bioinformatics 10.1101/2022.10.27.513962 medRxiv

Top 0.1%

23.1%

Show abstract

The transition towards graph pangenomes is posing several new challenging questions, most notably how to extend the classical notion of read alignment from a sequence-to-sequence to a sequence-to-graph setting. Especially on variation graphs, where paths corresponding to individual genomes are labeled, notions of alignments that are strongly inspired by the classical ones are usually able to capture only variations that can be expressed by mismatches or gaps, such as SNPs or short insertions and deletions. On the other hand the recent investigation of pangenomes at bacterial scale (Colquhoun et al, 2021) shows that most tools are tailored for human pangenomes and are not suited to bacteria which exhibit, among other characteristics, a larger variability. Such variability leads to the need for incorporating a greater flexibility when computing an alignment. In this paper, we extend the usual notion of sequence-to-graph alignment by including recombinations among the variations that explicitly represented and evaluated in an alignment. From a computational modeling point of view, a recombination corresponds to identifying a new path of the variation graph which is a mosaic of two different paths, possibly joined by a new arc. We provide a dynamic programming algorithm for computing an optimal alignment that allows recombinations with an affine penalty. We have implemented our approach with the tool RecGraph and we have analyzed its accuracy over some over some bacterial pangenome graphs.

5

Co-linear chaining with overlaps and gap costs

Jain, C.; Gibney, D.; Thankachan, S. V.

2021-02-03 bioinformatics 10.1101/2021.02.03.429492 medRxiv

Top 0.1%

23.1%

Show abstract

Co-linear chaining has proven to be a powerful heuristic for finding near-optimal alignments of long DNA sequences (e.g., long reads or a genome assembly) to a reference. It is used as an intermediate step in several alignment tools that employ a seed-chain-extend strategy. Despite this popularity, efficient subquadratic-time algorithms for the general case where chains support anchor overlaps and gap costs are not currently known. We present algorithms to solve the co-linear chaining problem with anchor overlaps and gap costs in O(n) time, where n denotes the count of anchors. We also establish the first theoretical connection between co-linear chaining cost and edit distance. Specifically, we prove that for a fixed set of anchors under a carefully designed chaining cost function, the optimal anchored edit distance equals the optimal co-linear chaining cost. Finally, we demonstrate experimentally that optimal co-linear chaining cost under the proposed cost function can be computed orders of magnitude faster than edit distance, and achieves correlation coefficient above 0.9 with edit distance for closely as well as distantly related sequences.

6

The Homo-Edit Distance Problem

Brand, M.; Tran, N. K.; Spohr, P.; Schrinner, S.; Klau, G. W.

2020-06-03 bioinformatics 10.1101/2020.05.27.118273 medRxiv

Top 0.1%

23.1%

Show abstract

We consider the homo-edit distance problem, which is the minimum number of homo-deletions or homo-insertions to convert one string into another. A homo-insertion is the insertion of a string of equal characters into another string, while a homo-deletion is the inverse operation. We show how to compute the homo-edit distance of two strings in polynomial time: We first demonstrate that the problem is equivalent to computing a common subsequence of the two input strings with a minimum number of homo-deletions and then present a dynamic programming solution for the reformulated problem. 2012 ACM Subject ClassificationApplied computing [->] Bioinformatics; Applied computing [->] Molecular sequence analysis; Theory of computation [->] Dynamic programming

7

On Inferring Additive and Replacing Horizontal Gene Transfers Through Phylogenetic Reconciliation

Kordi, M.; Kundu, S.; Bansal, M. S.

2020-03-29 bioinformatics 10.1101/2020.03.27.010785 medRxiv

Top 0.1%

22.8%

Show abstract

Horizontal gene transfer is one of the most important mechanisms for microbial evolution and adaptation. It is well known that horizontal gene transfer can be either additive or replacing depending on whether the transferred gene adds itself as a new gene in the recipient genome or replaces an existing homologous gene. Yet, all existing phylogenetic techniques for the inference of horizontal gene transfer assume either that all transfers are additive or that all transfers are replacing. This limitation not only affects the applicability and accuracy of these methods but also makes it difficult to distinguish between additive and replacing transfers. Here, we address this important problem by formalizing a phylogenetic reconciliation framework that simultaneously models both additive and replacing transfer events. Specifically, we (1) introduce the DTRL reconciliation framework that explicitly models both additive and replacing transfer events, along with gene duplications and losses, (2) prove that the underlying computational problem is NP-hard, (3) perform the first experimental study to assess the impact of replacing transfer events on the accuracy of the traditional DTL reconciliation model (which assumes that all transfers are additive) and demonstrate that traditional DTL reconciliation remains highly robust to the presence of replacing transfers, (4) propose a simple heuristic algorithm for DTRL reconciliation based on classifying transfer events inferred through DTL reconciliation as being replacing or additive, and (5) evaluate the classification accuracy of the heuristic under a range of evolutionary conditions. Thus, this work lays the methodological and algorithmic foundations for estimating DTRL reconciliations and distinguishing between additive and replacing transfers. An implementation of our heuristic for DTRL reconciliation is freely available open-source as part of the RANGER-DTL software package from https://compbio.engr.uconn.edu/software/ranger-dtl/.

8

The mod-minimizer: a simple and efficient sampling algorithm for long k-mers

Groot Koerkamp, R.; Pibiri, G. E.

2024-07-07 bioinformatics 10.1101/2024.05.25.595898 medRxiv

Top 0.1%

22.6%

Show abstract

MotivationGiven a string S, a minimizer scheme is an algorithm defined by a triple (k, w, [O]) that samples a subset of k-mers (k-long substrings) from a string S. Specifically, it samples the smallest k-mer according to the order [O] from each window of w consecutive k-mers in S. Because consecutive windows can sample the same k-mer, the set of the sampled k-mers is typically much smaller than S. More generally, we consider substring sampling algorithms that respect a window guarantee: at least one k-mer must be sampled from every window of w consecutive k-mers. As a sampled k-mer is uniquely identified by its absolute position in S, we can define the density of a sampling algorithm as the fraction of distinct sampled positions. Good methods have low density which, by respecting the window guarantee, is lower bounded by 1/w. It is however difficult to design a sequence-agnostic algorithm with provably optimal density. In practice, the order [O] is usually implemented using a pseudo-random hash function to obtain the so-called random minimizer. This scheme is simple to implement, very fast to compute even in streaming fashion, and easy to analyze. However, its density is almost a factor of 2 away from the lower bound for large windows. MethodsIn this work we introduce mod-sampling, a two-step sampling algorithm to obtain new minimizer schemes. Given a (small) parameter t, the mod-sampling algorithm finds the position p of the smallest t-mer in a window. It then samples the k-mer at position p mod w. The lr-minimizer uses t = k - w and the mod-minimizer uses t {equiv} k (mod w). ResultsThese new schemes have provably lower density than random minimizers and other schemes when k is large compared to w, while being as fast to compute. Importantly, the mod-minimizer achieves optimal density when k [->] {infty}. Although the mod-minimizer is not the first method to achieve optimal density for large k, its proof of optimality is simpler than previous work. We provide pseudocode for a number of other methods and compare to them. In practice, the mod-minimizer has considerably lower density than the random minimizer and other state-of-the-art methods, like closed syncmers and miniception, when k > w. We plugged the mod-minimizer into SSHash, a k-mer dictionary based on minimizers. For default parameters (w, k) = (11, 21), space usage decreases by 15% when indexing the whole human genome (GRCh38), while maintaining its fast query time. 2012 ACM Subject ClassificationTheory of computation [->] Sketching and sampling; Applied computing [->] Bioinformatics Digital Object Identifier10.4230/LIPIcs.WABI.2024.11 Supplementary MaterialSoftware (C++): github.com/jermp/minimizers Software (Rust): github.com/RagnarGrootKoerkamp/minimizers FundingRagnar Groot Koerkamp: ETH Research Grant ETH-1721-1 to Gunnar Ratsch. Giulio Ermanno Pibiri: European Unions Horizon Europe research and innovation programme (EFRA project, Grant Agreement Number 101093026). This work was also partially supported by DAIS - Ca Foscari University of Venice within the IRIDE program.

9

Geometry of Ranked Nearest Neighbour Interchange Space of Phylogenetic Trees

Collienne, L.; Elmes, K.; Fischer, M.; Bryant, D. J.; Gavryushkin, A.

2020-02-11 bioinformatics 10.1101/2019.12.19.883603 medRxiv

Top 0.1%

22.6%

Show abstract

AO_SCPLOWBSTRACTC_SCPLOWIn this paper we study the graph of ranked phylogenetic trees where the adjacency relation is given by a local rearrangement of the tree structure. Our work is motivated by tree inference algorithms, such as maximum likelihood and Markov Chain Monte Carlo methods, where the geometry of the search space plays a central role for efficiency and practicality of optimisation and sampling. We hence focus on understanding the geometry of the space (graph) of ranked trees, the so-called ranked nearest neighbour interchange (RNNI) graph. We find the radius and diameter of the space exactly, improving the best previously known estimates. Since the RNNI graph is a generalisation of the classical nearest neighbour interchange (NNI) graph to ranked phylogenetic trees, we compare geometric and algorithmic properties of the two graphs. Surprisingly, we discover that both geometric and algorithmic properties of RNNI and NNI are quite different. For example, we establish convexity of certain natural subspaces in RNNI which are not convex is NNI. Our results suggest that the complexity of computing distances in the two graphs is different.

10

Generating minimum-density minimizers

Shur, A.; Tziony, I.; Orenstein, Y.

2026-01-28 bioinformatics 10.64898/2026.01.25.701585 medRxiv

Top 0.1%

22.4%

Show abstract

Minimizers are sampling schemes which are ubiquitous in almost any high-throughput sequencing analysis. Assuming a fixed alphabet of size{sigma} , a minimizer is defined by two positive integers k, w and a linear order{rho} on k-mers. A sequence is processed by a sliding window algorithm that chooses in each window of length w + k - 1 its minimal k-mer with respect to{rho} . A key characteristic of a minimizer is its density, which is the expected frequency of chosen k-mers among all k-mers in a random infinite{sigma} -ary sequence. Minimizers of smaller density are preferred as they produce smaller samples, which lead to reduced runtime and memory usage in downstream applications. While the hardness of finding a minimizer of minimum density for given input parameters ({sigma}, k, w) is unknown, it has a huge search space of ({sigma}k)! and there is no known algorithm apart from a trivial brute-force search. In this paper, we tackle the minimum density problem for minimizers. We first formulate this problem as an ILP of size{Theta} (w{sigma}w+k), which has worst-case solution time that is doubly-exponential in (k + w) under standard complexity assumptions. Our experiments show that an ILP solver terminates with an optimal solution only for very small k and w. We then present our main method, called OptMini, which computes an optimal minimizer in [Formula] time and thus is capable of processing large w values. In experiments, OptMini works much faster than the runtime predicts due to several additional tricks shrinking the search space without harming optimality. We use OptMini to compute minimum-density minimizers for ({sigma}, k) [isin] {(2, 2), (2, 3), (2, 4), (2, 5), (2, 6), (4, 2)} and w [isin] [2, 3{sigma}k], with the exception of certain w-ranges for k = 6 and the single case of k = 5, w = 2. Finally, we derive conclusions and insights regarding the density values as a function of w, patterns in optimal minimizer orders, and the relation between minimum-size universal hitting sets and minimum-density minimizers.

11

Syotti: Scalable Bait Design for DNA Enrichment

Alanko, J.; Slizovskiy, I.; Lokshtanov, D.; Gagie, T.; Noyes, N.; Boucher, C.

2021-11-05 bioinformatics 10.1101/2021.11.05.467426 medRxiv

Top 0.1%

22.3%

Show abstract

Bait-enriched sequencing is a relatively new sequencing protocol that is becoming increasingly ubiquitous as it has been shown to successfully amplify regions of interest in metagenomic samples. In this method, a set of synthetic probes ("baits") are designed, manufactured, and applied to fragmented metagenomic DNA. The probes bind to the fragmented DNA and any unbound DNA is rinsed away, leaving the bound fragments to be amplified for sequencing. This effectively enriches the DNA for which the probes were designed. Most recently, Metsky et al. (Nature Biotech 2019) demonstrated that bait-enrichment is capable of detecting a large number of human viral pathogens within metagenomic samples. In this work, we formalize the problem of designing baits by defining the Minimum Bait Cover problem, which aims to find the smallest possible set of bait sequences that cover every position of a set of reference sequences under an approximate matching model. We show that the problem is NP-hard, and that it remains NP-hard under very restrictive assumptions. This indicates that no polynomial-time exact algorithm exists for the problem, and that the problem is intractable even for small and deceptively simple inputs. In light of this, we design an efficient heuristic that takes advantage of succinct data structures. We refer to our method as syotti. The running time of syotti shows linear scaling in practice, running at least an order of magnitude faster than state-of-the-art methods, including the recent method of Metsky et al. At the same time, our method produces bait sets that are smaller than the ones produced by the competing methods, while also leaving fewer positions uncovered. Lastly, we show that syotti requires only 25 minutes to design baits for a dataset comprised of 3 billion nucleotides from 1000 related bacterial substrains, whereas the method of Metsky et al. shows clearly super-linear running time and fails to process even a subset of 8% of the data in 24 hours. Our implementation is publicly available at https://github.com/jnalanko/syotti.

12

PLA-complexity of k-mer multisets

Abrar, M. H.; Medvedev, P.

2024-02-11 bioinformatics 10.1101/2024.02.08.579510 medRxiv

Top 0.1%

22.3%

Show abstract

MotivationUnderstanding structural properties of k-mer multisets is crucial to designing space-efficient indices to query them. A potentially novel source of structure can be found in the rank function of a k-mer multiset. In particular, the rank function of a k-mer multiset can be approximated by a piece-wise linear function with very few segments. Such an approximation was shown to speed up suffix array queries and sequence alignment. However, a more comprehensive study of the structure of rank functions of k-mer multisets and their potential applications is lacking. ResultsWe study a measure of a k-mer multiset complexity, which we call the PLA-complexity. The PLA-complexity is the number of segments necessary to approximate the rank function of a k-mer multiset with a piece-wise linear function so that the maximum error is bounded by a predefined threshold. We describe, implement, and evaluate the PLA-index, which is able to construct, compact, and query a piece-wise linear approximation of the k-mer rank function. We examine the PLA-complexity of more than 500 genome spectra and several other genomic multisets. Finally, we show how the PLA-index can be applied to several downstream applications to improve on existing methods: speeding up suffix array queries, decreasing the index memory of a short-read aligner, and decreasing the space of a direct access table of k-mer ranks. AvailabilityThe software and reproducibility information is freely available at https://github.com/medvedevgroup/pla-index

13

Building a Pangenome Alignment Index via Recursive Prefix-Free Parsing

Oliva, M.; Gagie, T.; Boucher, C.

2023-01-27 bioinformatics 10.1101/2023.01.26.525723 medRxiv

Top 0.1%

22.3%

Show abstract

MotivationPangenomics alignment has emerged as an opportunity to reduce bias in biomedical research. Traditionally, short read aligners--such as Bowtie and BWA--were used to index a single reference genome, which was then used to find approximate alignments of reads to that genome. Unfortunately, these methods can only index a small number of genomes due to the linear-memory requirement of the algorithms used to construct the index. Although there are a couple of emerging pangenome aligners that can index a larger number of genomes more algorithmic progress is needed to build an index for all available data. ResultsEmerging pangenomic methods include VG, Giraffe, and Moni, where the first two methods build an index a variation graph from the multiple alignment of the sequences, and Moni simply indexes all the sequences in a manner that takes the repetition of the sequences into account. Moni uses a preprocessing technique called prefix-free parsing to build a dictionary and parse from the input--these, in turn, are used to build the main run-length encoded BWT, and suffix array of the input. This is accomplished in linear space in the size of the dictionary and parse. Therein lies the open problem that we tackle in this paper. Although the dictionary scales nicely (sub-linear) with the size of the input, the parse becomes orders of magnitude larger than the dictionary. To scale the construction of Moni, we need to remove the parse from the construction of the RLBWT and suffix array. We accomplish this, in this paper by applying prefix-free parsing recursively on the parse. Although conceptually simple, this leads to an algorithmic challenge of constructing the RLBWT and suffix array without access to the parse. We solve this problem, implement it, and demonstrate that this improves the construction time by a factor of 8.9 the running time and by a factor of 2.7 the memory required. AvailabilityOur implementation is open source and available at https://github.com/marco-oliva/r-pfbwt. ContactMarco Oliva at marco.oliva@ufl.edu

14

AStarix: Fast and Optimal Sequence-to-Graph Alignment

Ivanov, P.; Bichsel, B.; Mustafa, H.; Kahles, A.; Rätsch, G.; Vechev, M.

2020-01-23 bioinformatics 10.1101/2020.01.22.915496 medRxiv

Top 0.1%

22.1%

Show abstract

We present an algorithm for the optimal alignment of sequences to genome graphs. It works by phrasing the edit distance minimization task as finding a shortest path on an implicit alignment graph. To find a shortest path, we instantiate the A[*] paradigm with a novel domain-specific heuristic function that accounts for the upcoming subsequence in the query to be aligned, resulting in a provably optimal alignment algorithm called ASO_SCPLOWTARIXC_SCPLOW. Experimental evaluation of ASO_SCPLOWTARIXC_SCPLOW shows that it is 1-2 orders of magnitude faster than state-of-the-art optimal algorithms on the task of aligning Illumina reads to reference genome graphs. Implementations and evaluations are available at https://github.com/eth-sri/astarix.

15

Fast and Scalable Parallel External-Memory Construction of Colored Compacted de Bruijn Graphs with Cuttlefish 3

Khan, J.; Dhulipala, L.; Patro, R.

2025-02-06 bioinformatics 10.1101/2025.02.02.636161 medRxiv

Top 0.1%

22.1%

Show abstract

The exponential growth of genomic data has created an urgent need for scalable sequence analysis algorithms. De Bruijn graphs--along with their colored and compacted variants--have become essential tools in modern bioinformatics pipelines. Colored compacted de Bruijn graphs condense repetitive sequence information, significantly reducing the data burden on downstream analyses like genome assembly, metagenomic clustering, and pan-genomics. Since constructing uncompacted graphs becomes computationally prohibitive at scale, direct construction methods for colored compacted de Bruijn graphs are essential for the scalability of downstream analyses. We present Cuttlefish 3, a parallel external-memory algorithm that delivers state-of-the-art performance for constructing colored compacted de Bruijn graphs. Our approach introduces three algorithmic innovations that enable efficient scaling to massive datasets while maintaining high performance. First, we develop an optimized technique for accelerating local subgraph contractions. Second, we design a deterministic parallel algorithm based on list-ranking to efficiently merge local solutions. Third, we introduce a novel combinable hash-based method for identifying and tracking color-changing nodes, enabling rapid color-set extraction. We evaluate Cuttlefish 3 on diverse large-scale genomic datasets. In our benchmarks, Cuttlefish 3 achieves 3.29-4.09x speedup over GGCAT, the current state-of-the-art tool, while maintaining comparable memory usage. These performance gains make Cuttlefish 3 a practical solution for representing and analyzing the growing volumes of genomic data in modern bioinformatics workflows. Cuttlefish 3 is implemented in C++17 and is available at https://github.com/COMBINE-lab/cuttlefish.

16

Minimizer Density revisited: Models and Multiminimizers

Ingels, F.; Robidou, L.; Martayan, I.; Marchet, C.; Limasset, A.

2026-02-17 bioinformatics 10.1101/2025.11.21.689688 medRxiv

Top 0.1%

21.9%

Show abstract

High-throughput sequence analysis commonly relies on k-mers (words of fixed length k) to remain tractable at modern scales. These k-mer-based pipelines can employ a sampling step, which in turn allows grouping consecutive k-mers into larger strings to improve data locality. Although other sampling strategies exist, local schemes have become standard: such schemes map each k-mer to the position of one of its characters. A key performance measure of these schemes is their density, defined as the expected fraction of selected positions. The most widely used local scheme is the minimizer scheme: given an integer m [≤] k, a minimizer scheme associates each k-mer to the starting position of one of its m-mers, called its minimizer. Being a local scheme, the minimizer scheme guarantees covering all k-mers of a sequence, with a maximal distance between selected positions of w = k - m + 1. Recent works have established near-tight lower bounds on achievable density under standard assumptions for local schemes, and state-of-the-art schemes now operate close to these limits, suggesting that further improvements under the classical notion of density will face diminishing returns. Hence, in this work, we aim to revisit the notion of density and broaden its scope. As a first contribution, we draw a link between density and the distance between consecutive selected positions. We propose a probabilistic model allowing us to establish that the density of a local scheme is exactly the inverse of the expected distance between the positions it selects, under the minimal and only assumption that said distances are somehow equally distributed. We emphasize here that our model makes no assumptions about how positions are selected, unlike the classical models in the literature. Our result introduces a novel method for computing the density of a local scheme, extending beyond classical settings. Based on this analysis, we introduce a novel technique, named multiminimizers, by associating each k-mer with a bounded set of candidate minimizers rather than a single one. The candidate furthest away (in a precise sense defined in the article) is selected. Since the decision is made by taking advantage of a context beyond a single k-mer, this technique is not a local scheme -- and belong to a novel category of meta schemes. Using the multiminimizer trick on a local scheme reduces its density at the expense of a controlled increase in computation time. We show that this method, when applied to random (hash-based) minimizers and to open-closed mod-minimizers, approaches a density of [Formula], representing, to our knowledge, the first construction converging to this limit. Our third contribution is the introduction of the deduplicated density, which measures the fraction of distinct minimizers used to cover all k-mers of a set of sequences. While this problem has gained traction in applications such as assembly, filtering, and pattern matching, standard minimizer schemes are often used as a proxy, blurring the distinction between the two objectives (minimizing the number of selected positions or the number of selected minimizers). Although related to the classical notion of density, deduplicated density differs in both definition and suitable constructions, and must be analyzed in its own right, together with its precise connections to standard density. We show that multiminimizers can also improve this metric, but that globally minimizing deduplicated density in this setting is NP-complete, and we instead propose a local heuristic with strong empirical behavior. Finally, we show that multiminimizers can be computed efficiently, and provide a SIMD-accelerated Rust implementation together with proofs of concept demonstrating reduced memory footprints on core sequence-analysis tasks. We conclude with open theoretical and practical questions that remain to be addressed in the area of density.

17

Optimizing sparse and skew hashing: faster k-mer dictionaries

Pibiri, G. E.; Patro, R.

2026-01-22 bioinformatics 10.64898/2026.01.21.700884 medRxiv

Top 0.1%

21.8%

Show abstract

MotivationRepresenting a set of k-mers -- strings of length k -- in small space under fast lookup queries is a fundamental requirement for several applications in Bioinformatics. A data structure based on sparse and skew hashing (SSHash) was recently proposed for this purpose [Pibiri, 2022]: it combines good space effectiveness with fast lookup and streaming queries. It is also order-preserving, i.e., consecutive k-mers (sharing a prefix-suffix overlap of length k-1) are assigned consecutive hash codes which helps compressing satellite data typically associated with k-mers, like abundances and color sets in colored De Bruijn graphs. ResultsWe study the problem of accelerating queries under the sparse and skew hashing indexing paradigm, without compromising its space effectiveness. We propose a refined data structure with less complex lookups and fewer cache misses. We give a simpler and faster algorithm for streaming lookup queries. Compared to indexes with similar capabilities and based on the Burrows-Wheeler transform, like SBWT and FMSI, SSHash is significantly faster to build and query. SSHash is competitive in space with the fast (and default) modality of SBWT when both k-mer strands are indexed. While larger than FMSI, it is also more than one order of magnitude faster to query. Availability and ImplementationThe SSHash software is available at https://github.com/jermp/sshash, and also distributed via Bioconda. A benchmark of data structures for k-mer sets is available at https://github.com/jermp/kmer_sets_benchmark. The datasets used in this article are described and available at https://zenodo.org/records/17582116. Contactgiulioermanno.pibiri@unive.it, rob@cs.umd.edu.

18

An Efficient Graph Algorithm for Diploid Local Ancestry Inference

Jafarzadeh, N.; Eizenga, J. M.; Paten, B.

2025-07-09 bioinformatics 10.1101/2025.07.05.662656 medRxiv

Top 0.1%

19.4%

Show abstract

In this paper, we present diploid sequence graphs, graphs whose paths encode pairs of haplotypes. We describe an efficient algorithm for creating a diploid graph from a directed acyclic (haploid) sequence graph, such that the diploid graph represents all the possible pairings of haplotypes present in the sequence graph and their similarity relationships. Starting with the sequence graph, our method uses a graph decomposition approach based on an extension of the SPQR-tree to systematically identify structural patterns that reduce redundancy while preserving genetic variation. We develop a polynomial-time algorithm that parsimoniously enumerates all disjoint paths with shared endpoints in two-terminal directed acyclic graphs. In the future, we envisage that diploid graphs may enable more accurate modeling of recombination, phasing, and variation-aware alignment in diploid genomes.

19

Sparse and Skew Hashing of K-Mers

Pibiri, G. E.

2022-01-18 bioinformatics 10.1101/2022.01.15.476199 medRxiv

Top 0.1%

19.4%

Show abstract

MotivationA dictionary of k-mers is a data structure that stores a set of n distinct k-mers and supports membership queries. This data structure is at the hearth of many important tasks in computational biology. High-throughput sequencing of DNA can produce very large k-mer sets, in the size of billions of strings - in such cases, the memory consumption and query efficiency of the data structure is a concrete challenge. ResultsTo tackle this problem, we describe a compressed and associative dictionary for k-mers, that is: a data structure where strings are represented in compact form and each of them is associated to a unique integer identifier in the range [0, n). We show that some statistical properties of k-mer minimizers can be exploited by minimal perfect hashing to substantially improve the space/time trade-off of the dictionary compared to the best-known solutions. AvailabilityThe C++ implementation of the dictionary is available at https://github.com/jermp/sshash. Contactgiulio.ermanno.pibiri@isti.cnr.it

20

On the Comparison of LGT networks and Tree-based Networks

Marchand, B.; Tahiri, N.; Tremblay-Savard, O.; Lafond, M.

2026-04-01 bioinformatics 10.1101/2025.11.20.689557 medRxiv

Top 0.1%

19.3%

Show abstract

Phylogenetic networks are widespread representations of evolutionary histories for taxa that undergo hybridization or Lateral-Gene Transfer (LGT) events. There are now many tools to reconstruct such networks, but no clearly established metric to compare them. Such metrics are needed, for example, to evaluate predictions against a simulated ground truth. Despite years of effort in developing metrics, known dissimilarity measures either do not distinguish all pairs of different networks, or are extremely difficult to compute. Since it appears challenging, if not impossible, to create the ideal metric for all classes of networks, it may be relevant to design them for specialized applications. In this article, we introduce a metric on LGT networks, which consist of trees with additional arcs that represent lateral gene transfer events. Our metric is based on edit operations, namely the addition/removal of transfer arcs, and the contraction/expansion of arcs of the base tree, allowing it to connect the space of all LGT networks. We show that it is linear-time computable if the order of transfers along a branch is unconstrained but NP-hard otherwise, in which case we provide a fixed-parameter tractable (FPT) algorithm in the level. We implemented our algorithms and demonstrate their applicability on three numerical experiments. Full online versionhttps://www.biorxiv.org/content/10.1101/2025.11.20.689557